Arrow keys / Space to navigate

Module 9: Observability and Monitoring

Developing Serverless Solutions on AWS

Topics

Monitoring Evolves into Observability

Analogy: Car dashboard vs mechanic diagnostic tool. Monitoring = checking the speedometer and fuel gauge (known metrics). Observability = plugging in a diagnostic tool that tells you WHY the engine light is on.
Monitoring (was)Observability (is)
Watching layers of your stackVisibility into the system as a whole
Identifying failures via probingUnderstanding if system behaves as expected
Focusing on metricsGaining insight: usage, experience, trends, root cause
Reactive (alert when broken)Proactive (understand before it breaks)

Distributed serverless applications make observability even more critical - no servers to SSH into, many small components.

Three Pillars of Observability

LOGS What happened? Stream of timestamped events CloudWatch Logs Debug & audit TRACES Where was the bottleneck? End-to-end request journey AWS X-Ray Latency & errors METRICS How is it performing? Numeric data over time CloudWatch Metrics Alerts & dashboards

Amazon CloudWatch Logs

API Gateway Logging

Logs = security camera footage. Records everything that happened. Useful for investigation but you need to search through it.

Structured Logging Best Practices

# Python - structured JSON logging (recommended!)
import json

message = {
    "level": "INFO",
    "timestamp": "2024-12-11T12:44:40.300Z",
    "requestId": "abc-123",
    "orderId": "ORD-456",
    "action": "AddToCart",
    "quantity": 2,
    "productId": "a23390f3",
    "environment": "prod"
}
print(json.dumps(message))

# CloudWatch Logs Insights can then query:
# fields @timestamp, orderId, action
# | filter level = "ERROR"
# | sort @timestamp desc
# | limit 20
Unstructured logs = handwritten notes. Structured JSON logs = spreadsheet. Much easier to search, filter, and aggregate.

AWS X-Ray - Distributed Tracing

X-Ray Service Map (animated request flow) API Gateway12ms Lambda85ms DynamoDB8ms SQS3ms Lambda 245ms SNS5ms

Instrumenting with X-Ray SDK

# Python - instrument your Lambda with X-Ray
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Patch all AWS SDK clients to capture traces automatically
patch_all()

def handler(event, context):
    # All boto3 calls are now traced automatically!
    # Add custom subsegments for your own code:
    subsegment = xray_recorder.begin_subsegment('process_order')
    try:
        result = process_order(event)
        subsegment.put_annotation('orderId', event['orderId'])
        subsegment.put_metadata('result', result)
    finally:
        xray_recorder.end_subsegment()
    return result
X-Ray = GPS tracking on a package. You can see exactly where it went, how long each stop took, and where it got stuck.

Amazon CloudWatch Metrics

Key Lambda Metrics to Monitor

MetricWhat It Tells YouAlarm On
ErrorsFailed invocationsAny increase above 0
DurationHow long function runsApproaching timeout
ThrottlesConcurrency limit hitAny occurrence
ConcurrentExecutionsActive instancesNear account limit
IteratorAgeStream processing lagGrowing over time

Custom Metrics & Embedded Metrics Format

# Embedded Metrics Format (EMF) - generate metrics from logs!
# No putMetricData API call needed - just print structured JSON

import json
from datetime import datetime

metric_log = {
    "_aws": {
        "Timestamp": int(datetime.now().timestamp() * 1000),
        "CloudWatchMetrics": [{
            "Namespace": "MyApp/Orders",
            "Dimensions": [["Environment", "Region"]],
            "Metrics": [
                {"Name": "OrderCount", "Unit": "Count"},
                {"Name": "OrderValue", "Unit": "None"}
            ]
        }]
    },
    "Environment": "prod",
    "Region": "us-west-2",
    "OrderCount": 1,
    "OrderValue": 149.99
}
print(json.dumps(metric_log))
# CloudWatch automatically extracts OrderCount and OrderValue as metrics!
EMF = automatic expense reporting. You write your receipt (log), and the system automatically categorizes and totals your spending (metric) - no separate submission needed.

CloudWatch Lambda Insights

Key Insight Metrics

MetricWhy It Matters
memory_utilizationAre you over/under-provisioned?
init_durationHow bad are cold starts?
cpu_total_timeIs function CPU-bound?
rx/tx_bytesNetwork I/O bottlenecks
Lambda Insights = fitness tracker for your functions. Tells you heart rate (CPU), energy burned (memory), sleep quality (cold starts).

Bringing the Three Pillars Together

X-Ray Trace Map = One-stop troubleshooting Logs What happened Traces Where it slowed down Metrics How it's trending CloudWatch integrates all three into a unified view for troubleshooting

The X-Ray trace map integrates X-Ray and CloudWatch, providing access to logs, metrics, and alarms from a single interface.

What's New (2024-2025)

Q1: What are the three pillars of observability?

B) Logs, Traces, Metrics
Logs = what happened (events). Traces = where it went (request path). Metrics = how it's performing (numbers over time).
A: Those are system resources. C: Those are CloudWatch features, not pillars. D: Those are security.

Q2: What does the Embedded Metrics Format (EMF) allow you to do?

B) Generate metrics from structured JSON logs
EMF lets you print a specially formatted JSON to stdout, and CloudWatch automatically extracts custom metrics from it - no putMetricData API call needed.
A: EMF goes to CloudWatch Metrics, not X-Ray. C: That's SnapStart/Provisioned. D: That's KMS encryption.

Q3: You notice a Lambda function occasionally takes 10x longer. Which tool best identifies the bottleneck?

B) AWS X-Ray traces
X-Ray shows the time spent in each segment/subsegment of a request. You can visually see which downstream call (DynamoDB, external API, etc.) caused the latency spike.
A: Logs tell you what happened but not latency breakdown. C: Alarms alert on thresholds but don't diagnose. D: Config tracks resource changes, not performance.

Q4: CloudWatch Logs retention is set to what by default?

C) Never expires (indefinite)
By default, CloudWatch Logs retains logs forever. This can get expensive! Best practice: Set a retention policy (e.g., 14 days for dev, 90 days for prod).
A/B/D: These are available options you can SET, but none is the default.

Live Demo: Observability in Action

Deploy a Lambda + API Gateway with full observability, then show logs, traces, and metrics.

API Gateway Lambda DynamoDB CloudWatch Logs X-Ray Traces CW Metrics

Demo Step 1: Deploy Function with X-Ray + Logging

Lambda Function (with structured logging + X-Ray)

# observability_demo.py
import json, os, time, boto3
from aws_xray_sdk.core import xray_recorder, patch_all

patch_all()  # Auto-trace all AWS SDK calls

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ.get('TABLE_NAME', 'demo-orders'))

def handler(event, context):
    start = time.time()
    order_id = event.get('queryStringParameters', {}).get('id', 'ORD-001')

    # Structured log (JSON)
    print(json.dumps({"level": "INFO", "action": "GetOrder", "orderId": order_id}))

    # Custom X-Ray subsegment
    with xray_recorder.in_subsegment('fetch_order') as subseg:
        subseg.put_annotation('orderId', order_id)
        item = table.get_item(Key={'orderId': order_id})

    duration = (time.time() - start) * 1000
    print(json.dumps({"level": "INFO", "action": "Complete", "duration_ms": round(duration)}))

    return {"statusCode": 200, "body": json.dumps(item.get('Item', {}))}

Demo Step 2: Deploy with CLI

# Create function with X-Ray enabled
zip observability_demo.zip observability_demo.py

aws lambda create-function --function-name observability-demo \
  --runtime python3.12 --handler observability_demo.handler \
  --role arn:aws:iam::ACCOUNT:role/lambda-xray-role \
  --zip-file fileb://observability_demo.zip \
  --environment Variables='{TABLE_NAME=demo-orders}' \
  --tracing-config Mode=Active \
  --logging-config LogFormat=JSON,ApplicationLogLevel=INFO

# Create DynamoDB table
aws dynamodb create-table --table-name demo-orders \
  --attribute-definitions AttributeName=orderId,AttributeType=S \
  --key-schema AttributeName=orderId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

# Put test item
aws dynamodb put-item --table-name demo-orders \
  --item '{"orderId":{"S":"ORD-001"},"item":{"S":"Laptop"},"amount":{"N":"999"}}'

# Create HTTP API + invoke
aws apigatewayv2 create-api --name observability-demo-api --protocol-type HTTP
# (attach integration + route, then test)

# Invoke directly to generate traces
aws lambda invoke --function-name observability-demo \
  --payload '{"queryStringParameters":{"id":"ORD-001"}}' output.json

Demo Step 3: What to Show in Console

ToolWhat to Demonstrate
CloudWatch LogsShow JSON structured logs, filter by orderId, run Logs Insights query
X-Ray TracesClick service map, drill into trace, show Lambda + DynamoDB segments with latency
CW MetricsShow Lambda Duration, Invocations, Errors graphs
Lambda InsightsEnable enhanced monitoring, show memory/CPU utilization
AlarmsCreate alarm on Errors > 0, show SNS notification

Logs Insights Query to Run Live

fields @timestamp, action, orderId, duration_ms
| filter level = "INFO"
| sort @timestamp desc
| limit 20

Module Summary